-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long string optimization for string column parsing in JSON reader #13803
Long string optimization for string column parsing in JSON reader #13803
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small comments, looks great otherwise.
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks:
two, I think, want addressed before merge (even if only to convince me that there is "nothing to see here"). Specifically:
- possible integer overflow in the grid stride loop
- logic difference in the two places UTF16 surrogate codepoints are handled
|
||
// This is indeed a UTF16 surrogate pair | ||
if (hex_val >= UTF16_HIGH_SURROGATE_BEGIN && hex_val < UTF16_HIGH_SURROGATE_END && | ||
hex_low_val >= UTF16_LOW_SURROGATE_BEGIN && hex_low_val < UTF16_LOW_SURROGATE_END) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is it valid to have two successive UTF-16 codepoints where the first is contained in [HIGH_SURROGATE_BEGIN, HIGH_SURROGATE_END)
and the second is not contained in [LOW_SURROGATE_BEGIN, LOW_SURROGATE_END)
? If not, then I think we are missing a parse error here for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it's considered as error for UTF-16.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal.
https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF_(surrogates)
It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in the format of UTF-16 by using a code unit equal to the code point. The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so if I understand correctly, we want logic that is something like:
hex_val = parse_unicode_hex(stream);
if (is_high_surrogate(hex_val)) {
auto hex_val_low = parse_unicode_hex(stream);
if (is_low_surrogate(hex_val_low)) {
// valid surrogate pair, decode
} else {
// invalid surrogate pair, parse error
// I think this case is missing
return {bytes, parse_error};
}
} else {
// non-surrogate pair, decode as normal
}
But I think the parse error in the case that we have a pair where the first is a surrogate pair and the second is not handled. Except that maybe this is deliberate per:
The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.
So you're just "carrying on" and letting an invalid character appear in the output stream. Is that right?
If so, maybe
suggestion: Add a comment explaining this parsing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
added a comment at the place where this is skipped, not thrown as error.
/ok to test |
Thanks for your hard work here! |
/ok to test |
Final benchmark:~70-80% runtime reduction for long strings. (overall range 30%-89%) json_read_string_column (single column) (branch-23.10 vs this PR)single column with max length string, json file size 512MB [0] Quadro GV100
json_read_string_column (multiple columns) (branch-23.10 vs this PR)64 column in json, json file size 512MB [0] Quadro GV100
|
Thanks to all the reviewers for the great inputs! Many suggestions lead to great performance improvements 🚀 |
/merge |
Description
closes #13724
In old code, 1 thread per string is allocated for parsing a string column.
For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings.
In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual.
256 threads_per_block is used for both kernels.
Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too.
Checklist